When to Go, and When to Explore: The Benefit of Post-Exploration in Intrinsic Motivation. (arXiv:2203.16311v2 [cs.LG] UPDATED)
Go-Explore achieved breakthrough performance on challenging reinforcement
learning (RL) tasks with sparse rewards. The key insight of Go-Explore was that
successful exploration requires an agent to first return to an interesting
state ('Go'), and only then explore into unknown terrain ('Explore'). We refer
to such exploration after a goal is reached as 'post-exploration'. In this
paper we present a systematic study of post-exploration, answering open
questions that the Go-Explore paper did not answer yet. First, we study the
isolated potential of post-exploration, by turning it on and off within the
same algorithm. Subsequently, we introduce new methodology to adaptively decide
when to post-explore and for how long to post-explore. Experiments on a range
of MiniGrid environments show that post-exploration indeed boosts performance
(with a bigger impact than tuning regular exploration parameters), and this
effect is further enhanced by adaptively deciding when and for how long to
post-explore. In short, our work identifies adaptive post-exploration as a
promising direction for RL exploration research.